Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug-Fix for invalid file descriptor validation and missing timeout in unix socket poll #281

Closed
wants to merge 3 commits into from

Conversation

alexander-mohr
Copy link
Contributor

@alexander-mohr alexander-mohr commented Dec 16, 2020

The poll timeout was only set for fifo.
TCP connections require this timeout as well.
Poll receives a timeout in milliseconds.
The given paramter was in nanoseconds.
An additional define adds the delay in miliseconds.

Some functions validated file descriptor to be greater 0.
If a process is started without stdin, stdout and stderr
the first file descriptor allocated by the process will be 0.
This also will be the case if the above mentioned file descriptors
will be closed on purpose.
As 0 is a valid fd, some methods had to be changed to reflect this.

The program was tested solely for our own use cases, which might differ from yours.
Licensed under Mozilla Public License Version 2.0

Alexander Mohr, [email protected], Daimler TSS GmbH, imprint

@alexander-mohr alexander-mohr changed the title Bug-Fix for invalid file descriptor validation and missing poll timeout in unix socket poll Bug-Fix for invalid file descriptor validation and missing timeout in unix socket poll Dec 16, 2020
@martin-ejdestig
Copy link
Contributor

martin-ejdestig commented Dec 17, 2020

The poll timeout was only set for fifo.
TCP connections require this timeout as well.

I am wondering a bit about this. Why is it needed? And I assume you mean UNIX/VSOCK and not TCP sockets...?

(Just happened to notice this pr and thought I would ask, since I added VSOCK support recently. I also change the timeout to be -1 unconditionally in #268 .)

@martin-ejdestig
Copy link
Contributor

martin-ejdestig commented Dec 17, 2020

Also note that timeout is (500 * 10^(6-3) / 3600)s ~= 138.8h for FIFOs, so effectively might as well have -1. So it is not really needed for FIFO:s either?

@alexander-mohr
Copy link
Contributor Author

I do mean UNIX/VSOCK. We had an issue on qnx where not setting this timeout would lead to blocking the thread in certain situations which lead to missing logs.
In our local build we actually set the timeout to 500ms instead of the timeout used in this PR.
Probably we should test if this is actually necessary. It might be enough to handle the file descriptors properly.
Nevertheless it might be a good idea to remove the ifdef and handle fifo and UNIX/VSOCK the same way.

@alexander-mohr
Copy link
Contributor Author

Can we make sure that a poll is returning if a connection is gone on all supported systems like qnx and linux or is it better to set a reasonable timeout for example for 1s to make sure this always returns?

@martin-ejdestig
Copy link
Contributor

martin-ejdestig commented Dec 17, 2020

Hum, I wonder what could make it "block" here and lead to missing messages. Messages are all received on the fd that is poll():ed? If fd is closed in other thread poll() should return?

I noticed #253 a while back. Perhaps there is a race here and the wrong fd ends up being polled?

Alexander Mohr added 2 commits December 17, 2020 12:21
The poll timeout was only set for fifo.
TCP connections require this timeout as well.
dlt_user_log_check_user_message also validated the file descriptor
to be greater 0.
Because 0 is a valid file descriptor this check has been changed
to greater or equal 0.
poll receives a timeout in milliseconds.
The given paramter was in nanoseconds.
An additional define takes adds the delay in miliseconds.

Signed-off-by: Alexander Mohr <[email protected]>
Some functions validated file descriptor to be greater 0.
If a process is started without stdin, stdout and stderr
the first file descriptor allocated by the process will be 0.
This also will be the case if the above mentioned file descriptors
will be closed on purpose.
As 0 is a valid fd, some methods had to be changed to reflect this.

Signed-off-by: Alexander Mohr <[email protected]>
@alexander-mohr
Copy link
Contributor Author

I've changed the timeout to 1s for now. I'm trying to find out if we can omit this timeout in general. I'll get back to as soon as I know more.

@alexander-mohr
Copy link
Contributor Author

alexander-mohr commented Dec 17, 2020

Let's assume the following scenario:
OS: QNX
At startup many applications will start before the dlt-daemon and start logging.
After the daemon is ready these applications will start writing theier local ring buffer into the dlt-daemon.
This creates a high load situation in the dlt-daemon.
In some cases an application is blocking in the poll and not returning.
Therefore the local ring buffer is never flushed and transmitted to the daemon.

Setting this timeout makes sure we always exit from this method and retry after the sleep in the housekeeper thread for example.
I was not able to debug this issue to get to the root cause of this lock up as it only happens in some special cases.
Especially is was not able to reproduce this at all on linux so it might be something that only happens on QNX.

What's your argument to omit this timeout?

@ssugiura
Copy link
Collaborator

@alexander-mohr @martin-ejdestig I assume you're correct; some timeout value needs to be set on poll.
There are mainly two things we need to handle in housekeeper thread; (1) handle log message from daemon (dlt_user_log_check_user_message), and (2) send messages which are stored in library internal buffer (dlt_user_log_resend_buffer).
As currently poll timeout is set to -1, it will be stuck in first handling, because we won't receive messages from daemon so frequently.
At the time we changed the value to -1, we wanted to reduce CPU usage of the library, but turned out to be wrong.
Triggered from this PR, we are currently doing performance measurement to come up with suitable timeout. In my person opinion, 1sec is too much, as we also have nanosleep in housekeeperthread_function, means next call will come up at maximum 1.5sec later.
We will get back to you soon if it's okay for you.

@alexander-mohr
Copy link
Contributor Author

Sounds great. We are also using a timeout of 500ms which seems kind of reasonable. But it's great that you are doing measurements to find the correct timeout.
Thanks!

@martin-ejdestig
Copy link
Contributor

martin-ejdestig commented Dec 18, 2020

Ah, dlt_user_log_resend_buffer() will not be called if poll() is not passed a timeout. I did not notice that. Agreed, a timeout to poll() that is not ~138 hours or -1 makes sense. :)

@ssugiura
Copy link
Collaborator

@alexander-mohr After our performance measurement, I suppose setting to 500msec looks good. So prefer the changes you made before to use DLT_USER_RECEIVE_MDELAY. If you agree, can you update as such? Thanks.

 Changed the poll timeout to 500ms after
 performance measurements resulted this
 as a good compromise.
 Introduced a define which is used in poll.

Signed-off-by: Alexander Mohr <[email protected]>
@alexmohr
Copy link
Contributor

alexmohr commented Jan 3, 2021

@alexander-mohr After our performance measurement, I suppose setting to 500msec looks good. So prefer the changes you made before to use DLT_USER_RECEIVE_MDELAY. If you agree, can you update as such? Thanks.

I've changed the code accordingly. Sorry for the late reply due to Christmas and new year.

@ssugiura
Copy link
Collaborator

ssugiura commented Jan 6, 2021

The changes were merged to master branch manually. Closing this PR.
Thanks @alexander-mohr for the fix!

@ssugiura ssugiura closed this Jan 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants